The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank
!pip install imblearn
!pip install xgboost
!pip install delayed
Requirement already satisfied: imblearn in c:\users\konto\anaconda3\lib\site-packages (0.0) Requirement already satisfied: imbalanced-learn in c:\users\konto\anaconda3\lib\site-packages (from imblearn) (0.8.0) Requirement already satisfied: scikit-learn>=0.24 in c:\users\konto\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.24.2) Requirement already satisfied: joblib>=0.11 in c:\users\konto\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.17.0) Requirement already satisfied: scipy>=0.19.1 in c:\users\konto\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.6.1) Requirement already satisfied: numpy>=1.13.3 in c:\users\konto\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.19.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\konto\anaconda3\lib\site-packages (from scikit-learn>=0.24->imbalanced-learn->imblearn) (2.1.0) Requirement already satisfied: xgboost in c:\users\konto\anaconda3\lib\site-packages (1.4.2) Requirement already satisfied: numpy in c:\users\konto\anaconda3\lib\site-packages (from xgboost) (1.19.2) Requirement already satisfied: scipy in c:\users\konto\anaconda3\lib\site-packages (from xgboost) (1.6.1) Requirement already satisfied: delayed in c:\users\konto\anaconda3\lib\site-packages (0.11.0b1) Requirement already satisfied: hiredis in c:\users\konto\anaconda3\lib\site-packages (from delayed) (2.0.0) Requirement already satisfied: redis in c:\users\konto\anaconda3\lib\site-packages (from delayed) (3.5.3)
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
#Manually deleted the first worksheet in order to import only the necessary data
orig=pd.read_csv("C:/Users/Konto/Documents/ESTUDOS/Texas university/Model_Tuning/Project/BankChurners.csv")
# copying data to another varaible to avoid any changes to original data
data =orig.copy()
data.head()
| Clientnr | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
data.shape #Data shape has enough input to make reliable predictions
(10127, 21)
data.info() #Types of data available in the set
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Clientnr 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
data.nunique() # check the number of unique values in each column
Clientnr 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
data.duplicated().sum() # there are no duplicates on the data
0
data.describe().round(decimals=2).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Clientnr | 10127.0 | 7.391776e+08 | 36903783.45 | 708082083.0 | 7.130368e+08 | 7.179264e+08 | 7.731435e+08 | 8.283431e+08 |
| Customer_Age | 10127.0 | 4.633000e+01 | 8.02 | 26.0 | 4.100000e+01 | 4.600000e+01 | 5.200000e+01 | 7.300000e+01 |
| Dependent_count | 10127.0 | 2.350000e+00 | 1.30 | 0.0 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 |
| Months_on_book | 10127.0 | 3.593000e+01 | 7.99 | 13.0 | 3.100000e+01 | 3.600000e+01 | 4.000000e+01 | 5.600000e+01 |
| Total_Relationship_Count | 10127.0 | 3.810000e+00 | 1.55 | 1.0 | 3.000000e+00 | 4.000000e+00 | 5.000000e+00 | 6.000000e+00 |
| Months_Inactive_12_mon | 10127.0 | 2.340000e+00 | 1.01 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Contacts_Count_12_mon | 10127.0 | 2.460000e+00 | 1.11 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Credit_Limit | 10127.0 | 8.631950e+03 | 9088.78 | 1438.3 | 2.555000e+03 | 4.549000e+03 | 1.106750e+04 | 3.451600e+04 |
| Total_Revolving_Bal | 10127.0 | 1.162810e+03 | 814.99 | 0.0 | 3.590000e+02 | 1.276000e+03 | 1.784000e+03 | 2.517000e+03 |
| Avg_Open_To_Buy | 10127.0 | 7.469140e+03 | 9090.69 | 3.0 | 1.324500e+03 | 3.474000e+03 | 9.859000e+03 | 3.451600e+04 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 7.600000e-01 | 0.22 | 0.0 | 6.300000e-01 | 7.400000e-01 | 8.600000e-01 | 3.400000e+00 |
| Total_Trans_Amt | 10127.0 | 4.404090e+03 | 3397.13 | 510.0 | 2.155500e+03 | 3.899000e+03 | 4.741000e+03 | 1.848400e+04 |
| Total_Trans_Ct | 10127.0 | 6.486000e+01 | 23.47 | 10.0 | 4.500000e+01 | 6.700000e+01 | 8.100000e+01 | 1.390000e+02 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 7.100000e-01 | 0.24 | 0.0 | 5.800000e-01 | 7.000000e-01 | 8.200000e-01 | 3.710000e+00 |
| Avg_Utilization_Ratio | 10127.0 | 2.700000e-01 | 0.28 | 0.0 | 2.000000e-02 | 1.800000e-01 | 5.000000e-01 | 1.000000e+00 |
# let's view the statistical summary of the non-numerical columns in the data
# list of all categorical variables
cat_col = data.select_dtypes(include=["object", "category"])
# printing the number of occurrences of each unique value in each categorical column
for column in cat_col:
print(data[column].value_counts())
print("-" * 50)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 -------------------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 -------------------------------------------------- Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 -------------------------------------------------- Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 -------------------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 -------------------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 --------------------------------------------------
# Clientnr is unique for each candidate and might not add value to modeling
data.drop(["Clientnr"], axis=1, inplace=True)
all_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20, 30))
for i in range(len(all_col)):
plt.subplot(6, 3, i + 1)
sns.histplot(data[all_col[i]], bins = 20, kde=True) # Analysis below
plt.show()
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
all_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,20))
for i, variable in enumerate(numeric_columns):
plt.subplot(6,3,i+1)
plt.boxplot(data[variable],whis=1.5, vert = 0)
plt.tight_layout()
plt.title(variable)
corr = (
data[numeric_columns].corr()
)
f, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(
corr,
cmap="ocean",
annot=True,
fmt=".1f",
vmin=-1,
vmax=1,
center=0,
square=False,
linewidths=0.7,
cbar_kws={"shrink": 0.5},
)
<AxesSubplot:>
Plots = data.iloc[:, 0: ]
sns.pairplot(Plots, diag_kind='kde', hue= "Attrition_Flag")
<seaborn.axisgrid.PairGrid at 0x19ede9fd280>
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Education_Level", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 abc 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(16, 8))
plt.title('Credit Limit and Total Transaction Amount per Attrition', size= 15)
sns.scatterplot(y= "Credit_Limit", x="Total_Trans_Amt", hue = 'Attrition_Flag', data=data)
plt.show()
plt.figure(figsize=(16, 8))
plt.title('Average Utilization Rate and Total Transaction Amount per Attrition', size= 15)
sns.scatterplot(y= "Avg_Utilization_Ratio", x="Total_Trans_Amt", hue = 'Attrition_Flag', data=data)
plt.show()
## function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1")
data[data["Total_Amt_Chng_Q4_Q1"] > 2.5]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 8 | Existing Customer | 37 | M | 3 | Uneducated | Single | $60K - $80K | Blue | 36 | 5 | 2 | 0 | 22352.0 | 2517 | 19835.0 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 12 | Existing Customer | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.0 | 0 | 11751.0 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
| 773 | Existing Customer | 61 | M | 0 | Post-Graduate | Married | abc | Blue | 53 | 6 | 2 | 3 | 14434.0 | 1927 | 12507.0 | 2.675 | 1731 | 32 | 3.571 | 0.134 |
# Capping values for amount spent on meat products at next highest value i.e. 984
data["Total_Amt_Chng_Q4_Q1"].clip(upper=2.4, inplace=True)
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1")
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1")
data[data["Total_Ct_Chng_Q4_Q1"] > 2.5]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 12 | Existing Customer | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.0 | 0 | 11751.0 | 2.400 | 1539 | 17 | 3.250 | 0.000 |
| 30 | Existing Customer | 53 | M | 3 | NaN | Married | $80K - $120K | Blue | 33 | 3 | 2 | 3 | 2753.0 | 1811 | 942.0 | 0.977 | 1038 | 25 | 2.571 | 0.658 |
| 113 | Existing Customer | 54 | F | 0 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 2 | 2 | 1494.0 | 706 | 788.0 | 1.674 | 1305 | 24 | 3.000 | 0.473 |
| 146 | Existing Customer | 41 | F | 2 | Graduate | Single | Less than $40K | Blue | 32 | 6 | 3 | 2 | 2250.0 | 2117 | 133.0 | 1.162 | 1617 | 31 | 2.875 | 0.941 |
| 190 | Existing Customer | 57 | M | 1 | Graduate | Married | $80K - $120K | Blue | 47 | 5 | 3 | 1 | 14612.0 | 1976 | 12636.0 | 1.768 | 1827 | 24 | 3.000 | 0.135 |
| 269 | Existing Customer | 54 | M | 5 | Graduate | Married | $60K - $80K | Blue | 38 | 3 | 3 | 3 | 2290.0 | 1434 | 856.0 | 0.923 | 1119 | 18 | 3.500 | 0.626 |
| 366 | Existing Customer | 36 | F | 4 | Graduate | Married | $40K - $60K | Blue | 36 | 6 | 3 | 3 | 1628.0 | 969 | 659.0 | 0.999 | 1893 | 15 | 2.750 | 0.595 |
| 773 | Existing Customer | 61 | M | 0 | Post-Graduate | Married | abc | Blue | 53 | 6 | 2 | 3 | 14434.0 | 1927 | 12507.0 | 2.400 | 1731 | 32 | 3.571 | 0.134 |
# Capping values for amount spent on meat products at next highest value i.e. 984
data["Total_Ct_Chng_Q4_Q1"].clip(upper=2.4, inplace=True)
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1")
# Dropping the variables below. Avg open to buy provides the same information as credit limit.
# Total trans ct also provides similar information than Total trans Amt, only in different format.
# Customer age is not so important as Months on book, because the relatioship with the bank is more important for CC analysis.
data.drop(
columns=[
"Avg_Open_To_Buy",
"Total_Trans_Ct",
"Customer_Age",
],
inplace=True,
)
df = data.copy() # create a copy of the file
df['Income_Category'].replace('abc', np.nan, inplace=True)
df.Income_Category.value_counts() # 'abc' is now missing value
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 $120K + 727 Name: Income_Category, dtype: int64
df.isnull().sum() # three categories with null values
Attrition_Flag 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
# defining a list with names of columns that will be used for imputation
reqd_col_for_impute = [
"Education_Level",
"Marital_Status",
"Income_Category"
]
data1 = df.copy()
# we need to pass numerical values for each categorical column for KNN imputation so we will label encode them
Gender = {"M": 1, "F": 2}
data1["Gender"] = data1["Gender"].map(Gender)
Education_Level = {"Graduate": 2, "High School": 1, "Uneducated": 0, "College": 3, "Post-Graduate": 4, "Doctorate": 5}
data1["Education_Level"] = data1["Education_Level"].map(Education_Level)
Marital_Status = {"Married": 0, "Single": 1, "Divorced": 2}
data1["Marital_Status"] = data1["Marital_Status"].map(Marital_Status)
Card_Category = {
"Blue": 0,
"Silver": 1,
"Gold": 2,
"Platinum": 3,
}
data1["Card_Category"] = data1["Card_Category"].map(Card_Category)
Income_Category = {"Less than $40K": 1, "$40K - $60K": 2, "$60K - $80K": 3, "$80K - $120K": 4, "$120K +": 5 }
data1["Income_Category"] = data1["Income_Category"].map(Income_Category)
data1.head()
| Attrition_Flag | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 1 | 3 | 1.0 | 0.0 | 3.0 | 0 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 1.335 | 1144 | 1.625 | 0.061 |
| 1 | Existing Customer | 2 | 5 | 2.0 | 1.0 | 1.0 | 0 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 1.541 | 1291 | 2.400 | 0.105 |
| 2 | Existing Customer | 1 | 3 | 2.0 | 0.0 | 4.0 | 0 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 2.400 | 1887 | 2.333 | 0.000 |
| 3 | Existing Customer | 2 | 4 | 1.0 | NaN | 1.0 | 0 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 1.405 | 1171 | 2.333 | 0.760 |
| 4 | Existing Customer | 1 | 3 | 0.0 | 0.0 | 3.0 | 0 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 2.175 | 816 | 2.400 | 0.000 |
X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
X.shape
(10127, 16)
y.shape
(10127,)
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y # SHUFFLE THE DATA WHEN SPLITTING, sometimes it is ordered
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 16) (2026, 16) (2026, 16)
imputer = KNNImputer(n_neighbors=5)
impute = imputer.fit(X_train)
# Fit and transform the train data
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])
# Transform the train data
X_val[reqd_col_for_impute] = imputer.fit_transform(X_val[reqd_col_for_impute])
# Transform the test data
X_test[reqd_col_for_impute] = imputer.transform(X_test[reqd_col_for_impute])
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
## Function to inverse the encoding
def inverse_mapping(x, y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype("category")
X_val[y] = np.round(X_val[y]).map(inv_dict).astype("category")
X_test[y] = np.round(X_test[y]).map(inv_dict).astype("category")
inverse_mapping(Gender, "Gender")
inverse_mapping(Education_Level, "Education_Level")
inverse_mapping(Marital_Status, "Marital_Status")
inverse_mapping(Card_Category, "Card_Category")
inverse_mapping(Card_Category, "Income_Category")
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
F 3193 M 2882 Name: Gender, dtype: int64 ****************************** Graduate 2365 High School 1612 Uneducated 881 College 651 Post-Graduate 312 Doctorate 254 Name: Education_Level, dtype: int64 ****************************** Married 3057 Single 2561 Divorced 457 Name: Marital_Status, dtype: int64 ****************************** Silver 2136 Gold 1517 Platinum 1020 Name: Income_Category, dtype: int64 ****************************** Blue 5655 Silver 339 Gold 69 Platinum 12 Name: Card_Category, dtype: int64 ******************************
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 30)
F 1070 M 956 Name: Gender, dtype: int64 ****************************** Graduate 849 High School 425 Uneducated 300 College 201 Post-Graduate 153 Doctorate 98 Name: Education_Level, dtype: int64 ****************************** Married 963 Single 901 Divorced 162 Name: Marital_Status, dtype: int64 ****************************** Silver 719 Gold 463 Platinum 413 Name: Income_Category, dtype: int64 ****************************** Blue 1876 Silver 119 Gold 26 Platinum 5 Name: Card_Category, dtype: int64 ******************************
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)
F 1095 M 931 Name: Gender, dtype: int64 ****************************** Graduate 832 High School 442 Uneducated 306 College 204 Post-Graduate 143 Doctorate 99 Name: Education_Level, dtype: int64 ****************************** Married 1024 Single 846 Divorced 156 Name: Marital_Status, dtype: int64 ****************************** Silver 758 Gold 447 Platinum 392 Name: Income_Category, dtype: int64 ****************************** Blue 1905 Silver 97 Gold 21 Platinum 3 Name: Card_Category, dtype: int64 ******************************
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 24) (2026, 24) (2026, 24)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 70.28519099947671 Random forest: 65.77603349031921 GBM: 75.20303506017791 Adaboost: 72.63893249607536 Xgboost: 82.16954474097331 dtree: 70.28519099947671 Logistic Regression: 34.3265306122449 Training Performance: Bagging: 97.43852459016394 Random forest: 100.0 GBM: 81.55737704918032 Adaboost: 75.20491803278688 Xgboost: 100.0 dtree: 100.0 Logistic Regression: 34.93852459016394
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Buidling the SMOTE Model
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 976 Before UpSampling, counts of label 'No': 5099 After UpSampling, counts of label 'Yes': 5099 After UpSampling, counts of label 'No': 5099 After UpSampling, the shape of train_X: (10198, 24) After UpSampling, the shape of train_y: (10198,)
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over)) * 100
print("Recall performance for training data after Upsampling {} : {} \n".format(name, scores))
Recall performance for training data after Upsampling Bagging : 99.68621298293783 Recall performance for training data after Upsampling Random forest : 100.0 Recall performance for training data after Upsampling GBM : 96.43067268091782 Recall performance for training data after Upsampling Adaboost : 93.80270641302216 Recall performance for training data after Upsampling Xgboost : 100.0 Recall performance for training data after Upsampling dtree : 100.0 Recall performance for training data after Upsampling Logistic Regression : 75.79917630908021
modelr = XGBClassifier(random_state=1, eval_metric="logloss")
modelr.fit(X_train_over, y_train_over)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculating different metrics on train set
log_reg_over_train_perf = model_performance_classification_sklearn(
modelr, X_train_over, y_train_over
)
print("Training performance:")
log_reg_over_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on validation set
log_reg_over_val_perf = model_performance_classification_sklearn(
modelr, X_val, y_val
)
print("validation performance:")
log_reg_over_val_perf
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.955577 | 0.874233 | 0.853293 | 0.863636 |
# creating confusion matrix
confusion_matrix_sklearn(modelr, X_val, y_val)
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 24) After Under Sampling, the shape of train_y: (1952,)
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un)) * 100
print("Recall Performance Under Sampling {} : {} \n".format(name, scores))
Recall Performance Under Sampling Bagging : 99.18032786885246 Recall Performance Under Sampling Random forest : 100.0 Recall Performance Under Sampling GBM : 96.4139344262295 Recall Performance Under Sampling Adaboost : 92.72540983606558 Recall Performance Under Sampling Xgboost : 100.0 Recall Performance Under Sampling dtree : 100.0 Recall Performance Under Sampling Logistic Regression : 75.40983606557377
modelu = XGBClassifier(random_state=1, eval_metric="logloss")
modelu.fit(X_train_un, y_train_un)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculating different metrics on train set
log_reg_under_train_perf = model_performance_classification_sklearn(
modelu, X_train_un, y_train_un
)
print("Training performance:")
log_reg_under_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on validation set
log_reg_under_val_perf = model_performance_classification_sklearn(
modelu, X_val, y_val
)
print("Validation performance:")
log_reg_under_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.919052 | 0.947853 | 0.677632 | 0.790281 |
# creating confusion matrix
confusion_matrix_sklearn(modelu, X_val, y_val)
Since
had in downsample and upsample a good performance, the tunning will be done on the first two models and consider undersampled data. I will not use XGBoost due to the time consuming process and the model will also be analyzed later for the total data, which will provide enough input.
%%time
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train_un, y_train_un)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train_un, y_train_un)
Wall time: 6min 31s
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=5,
min_impurity_decrease=0.0001, random_state=1)
# Calculating different metrics on train set
dtree_estimator_train = model_performance_classification_sklearn(
dtree_estimator, X_train_un, y_train_un
)
print("Training performance:")
dtree_estimator_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.790984 | 0.936475 | 0.725397 | 0.817531 |
# Calculating different metrics on validation set
dtree_estimator_val = model_performance_classification_sklearn(dtree_estimator, X_val, y_val)
print("Validation performance:")
dtree_estimator_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.686081 | 0.929448 | 0.330786 | 0.487923 |
# creating confusion matrix
confusion_matrix_sklearn(dtree_estimator, X_val, y_val)
%%time
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100,200,300],
"min_samples_leaf": np.arange(1, 6,1),
"max_features": [0.7,0.9,'log2','auto'],
"max_samples": [0.7,0.9,None],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train_un, y_train_un)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train_un, y_train_un)
Wall time: 31min 2s
RandomForestClassifier(max_features=0.9, min_samples_leaf=5, n_estimators=200,
random_state=1)
# Calculating different metrics on train set
rf_estimator_train = model_performance_classification_sklearn(
rf_estimator, X_train_un, y_train_un
)
print("Training performance:")
rf_estimator_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.963627 | 0.977459 | 0.951147 | 0.964123 |
# Calculating different metrics on validation set
rf_estimator_val = model_performance_classification_sklearn(rf_estimator, X_val, y_val)
print("Validation performance:")
rf_estimator_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.896347 | 0.953988 | 0.614625 | 0.747596 |
# creating confusion matrix
confusion_matrix_sklearn(rf_estimator, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1), 'learning_rate': 1, 'n_estimators': 100}
Score: 0.8319727891156463
Wall time: 5min 38s
# building model with best parameters
adb_tuned1 = AdaBoostClassifier(
n_estimators=100,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
# Fit the model on training data
adb_tuned1.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100, random_state=1)
# Calculating different metrics on train set
Adaboost_grid_train = model_performance_classification_sklearn(
adb_tuned1, X_train, y_train
)
print("Training performance:")
Adaboost_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988148 | 0.955943 | 0.969854 | 0.962848 |
# Calculating different metrics on validation set
Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958539 | 0.846626 | 0.890323 | 0.867925 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned1, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.8319727891156463:
Wall time: 2min 3s
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=100,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100, random_state=1)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train, y_train
)
print("Training performance:")
Adaboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988148 | 0.955943 | 0.969854 | 0.962848 |
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(adb_tuned2, X_val, y_val)
print("Validation performance:")
Adaboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958539 | 0.846626 | 0.890323 | 0.867925 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_val, y_val)
%%time
#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in GridSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 2304 candidates, totalling 11520 fits
Best parameters are {'gamma': 0, 'learning_rate': 0.2, 'max_depth': 2, 'n_estimators': 50, 'reg_lambda': 5, 'scale_pos_weight': 10, 'subsample': 0.9} with CV score=0.9600575614861329:
Wall time: 1h 4min 55s
# building model with best parameters
xgb_tuned = XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.9,
learning_rate=0.2,
gamma=0,
eval_metric="logloss",
reg_lambda=5,
max_depth=2,
)
# Fit the model on training data
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.2, max_delta_step=0,
max_depth=2, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=5,
scale_pos_weight=10, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
#Calculating different metrics
xgb_tuned_model_train_perf=model_performance_classification_sklearn(xgb_tuned, X_train, y_train)
print("Training performance:")
xgb_tuned_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.874239 | 0.980533 | 0.56228 | 0.714712 |
xgb_tuned_model_val_perf=model_performance_classification_sklearn(xgb_tuned, X_val, y_val)
print("Validation performance:")
xgb_tuned_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.869694 | 0.97546 | 0.554007 | 0.706667 |
confusion_matrix_sklearn(xgb_tuned, X_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[
dtree_estimator_train.T,
rf_estimator_train.T,
Adaboost_grid_train.T,
Adaboost_random_train.T,
xgb_tuned_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree undersample tunned",
"Random Forrest undersampled tunned",
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"Xgboost Tuned with Grid search",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree undersample tunned | Random Forrest undersampled tunned | AdaBoost Tuned with Grid search | AdaBoost Tuned with Random search | Xgboost Tuned with Grid search | |
|---|---|---|---|---|---|
| Accuracy | 0.790984 | 0.963627 | 0.988148 | 0.988148 | 0.874239 |
| Recall | 0.936475 | 0.977459 | 0.955943 | 0.955943 | 0.980533 |
| Precision | 0.725397 | 0.951147 | 0.969854 | 0.969854 | 0.562280 |
| F1 | 0.817531 | 0.964123 | 0.962848 | 0.962848 | 0.714712 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
dtree_estimator_val.T,
rf_estimator_val.T,
Adaboost_grid_val.T,
Adaboost_random_val.T,
xgb_tuned_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Decision Tree undersample tunned",
"Random Forrest undersample tunned",
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"Xgboost Tuned with Grid search",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Decision Tree undersample tunned | Random Forrest undersample tunned | AdaBoost Tuned with Grid search | AdaBoost Tuned with Random search | Xgboost Tuned with Grid search | |
|---|---|---|---|---|---|
| Accuracy | 0.686081 | 0.896347 | 0.958539 | 0.958539 | 0.869694 |
| Recall | 0.929448 | 0.953988 | 0.846626 | 0.846626 | 0.975460 |
| Precision | 0.330786 | 0.614625 | 0.890323 | 0.890323 | 0.554007 |
| F1 | 0.487923 | 0.747596 | 0.867925 | 0.867925 | 0.706667 |
# Test performance Decision Tree
dtree_estimator_test = model_performance_classification_sklearn(dtree_estimator, X_test, y_test)
print("Test performance:")
dtree_estimator_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.687562 | 0.944615 | 0.332972 | 0.492382 |
# Test performance Random Forreest
rf_estimator_test = model_performance_classification_sklearn(rf_estimator, X_test, y_test)
print("Test performance:")
rf_estimator_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89388 | 0.938462 | 0.61 | 0.739394 |
# Calculating different metrics on the test set Adaboost search CV
Adaboost_grid_test = model_performance_classification_sklearn(adb_tuned1, X_test, y_test)
print("Test performance:")
Adaboost_grid_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.957552 | 0.84 | 0.889251 | 0.863924 |
feature_names = X_train.columns
importances = adb_tuned1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances Adaboost search CV")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Calculating different metrics on the test set Adaboost search CV
Adaboost_grid2_test = model_performance_classification_sklearn(adb_tuned2, X_test, y_test)
print("Test performance:")
Adaboost_grid2_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.957552 | 0.84 | 0.889251 | 0.863924 |
feature_names = X_train.columns
importances = adb_tuned2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances Adaboost Randomsearch CV")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Calculating different metrics on the test set Adaboost search CV
xgb_tuned_test = model_performance_classification_sklearn(xgb_tuned, X_test, y_test)
print("Test performance:")
xgb_tuned_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.860809 | 0.969231 | 0.536627 | 0.690789 |
feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances XGBoost tunned")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Test performance comparison
models_val_comp_df = pd.concat(
[
dtree_estimator_test.T,
rf_estimator_test.T,
Adaboost_grid_test.T,
Adaboost_grid2_test.T,
xgb_tuned_test.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Decision Tree undersamp. tunned",
"Random Forest undersamp. tunned",
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"Xgboost Tuned with Grid search",
]
print("Test performance comparison:")
models_val_comp_df
Test performance comparison:
| Decision Tree undersamp. tunned | Random Forest undersamp. tunned | AdaBoost Tuned with Grid search | AdaBoost Tuned with Random search | Xgboost Tuned with Grid search | |
|---|---|---|---|---|---|
| Accuracy | 0.687562 | 0.893880 | 0.957552 | 0.957552 | 0.860809 |
| Recall | 0.944615 | 0.938462 | 0.840000 | 0.840000 | 0.969231 |
| Precision | 0.332972 | 0.610000 | 0.889251 | 0.889251 | 0.536627 |
| F1 | 0.492382 | 0.739394 | 0.863924 | 0.863924 | 0.690789 |
# creating a list of numerical variables
numerical_features = [
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a list of categorical variables
categorical_features = ["Gender", "Education_Level", "Marital_Status", "Income_Category", "Card_Category",]
# creating a transformer for categorical variables, which will first apply simple imputer and
#then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
X = data1.drop(["Attrition_Flag"], axis=1)
Y = data1["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(7088, 16) (3039, 16)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=1,
learning_rate=0.05,
gamma=0,
eval_metric="logloss",
reg_lambda=10,
max_depth=1,
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_Revolving_Bal',
'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt',
'Total_Ct_C...
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.05,
max_delta_step=0, max_depth=1,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=10, scale_pos_weight=10,
subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None))])
The Model that best performs in Recall and Accuracy is the XGBoost Gridsearch CV. Recall is above 0.95 and Accuracy above 0.7. XG Boost had good peformance in all data sets: training validation and test. While other models had ups and downs in metrics depending on the data used. Therefore, the model will be more reliable in production.
The Top 10 Parameters in the different models are very similar, they changed positions during the analysis, but most of them remained on the top. These include among others: Total Transaction Amount in the last 12 months, Total transaction count in 4th quarter compared to 1st quarter, The Balance carried from one month to the next (Revolving Balance) .
Since Education, type of card, Marital status and other variables do not seem to have such an effect on credit card holders, the bank should focus on the level of activities of customers on the top 10 features and capitalize investiments to keep the customers that match this profile.